Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Fri Dec 1 09:03:52 2023"
I have some experience with R and other languages like Python. I find that the most challenging aspect is getting used to the logic of how RStudio and Git interact (i.e. getting the changes sent to my repository and getting the diary to update)
I went over exercise 1 briefly and it was overwhelming the length of it, I am not sure if we were expected to go over it all (but if we are, then I don’t think I will be able to keep up).
Nevertheless, I am still up for the challenge and will just try and absorb as much as I can.
Here is the link to my repository: https://github.com/nicolasyeung/IODS-project
This week I learning quite a bit about data wrangling and performing some basic linear regression
date()
## [1] "Fri Dec 1 09:03:52 2023"
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
students2014 <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/learning2014.csv")
str(students2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
ggpairs(students2014, columns = 3:8)
first_model <- lm(formula= Points ~ attitude + deep + Age + surf + stra, data=students2014)
summary(first_model)
##
## Call:
## lm(formula = Points ~ attitude + deep + Age + surf + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4759 -3.2812 0.3804 3.5381 11.4897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.30059 5.24076 3.492 0.00062 ***
## attitude 3.43200 0.56991 6.022 1.14e-08 ***
## deep -1.04374 0.78179 -1.335 0.18375
## Age -0.09650 0.05335 -1.809 0.07234 .
## surf -1.10529 0.84009 -1.316 0.19016
## stra 0.96551 0.53917 1.791 0.07523 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.249 on 160 degrees of freedom
## Multiple R-squared: 0.2311, Adjusted R-squared: 0.2071
## F-statistic: 9.62 on 5 and 160 DF, p-value: 4.803e-08
second_model <- lm(formula= Points ~ attitude + deep + surf + stra, data=students2014)
summary(second_model)
##
## Call:
## lm(formula = Points ~ attitude + deep + surf + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7335 -3.1992 0.7975 3.4002 11.6582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.4000 5.0245 3.065 0.00255 **
## attitude 3.4362 0.5739 5.987 1.34e-08 ***
## deep -1.0072 0.7870 -1.280 0.20247
## surf -0.9105 0.8390 -1.085 0.27948
## stra 0.8847 0.5411 1.635 0.10398
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.286 on 161 degrees of freedom
## Multiple R-squared: 0.2154, Adjusted R-squared: 0.1959
## F-statistic: 11.05 on 4 and 161 DF, p-value: 6.057e-08
plot(first_model, which = 1, caption = "Residuals vs Fitted")
plot(first_model, which = 2)
plot(first_model, which = 5)
library(tidyverse)
library(dplyr)
library(ggplot2)
alc <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/data/Assignement3_Logistic_Regression/alc.csv")
g1 <- ggplot(data = alc, aes(x = high_use, y = famrel))
g1 + geom_boxplot() + ylab("famrel")
####Interesting, there seems to be a trend here that higher familial
relationships do correlate with less alcohol consumption but it seems
weak. My prediction is that if I use famrel as a variable in my logistic
regression it will not give a strong Rsquared value.
g2 <- ggplot(data = alc, aes(x = high_use, y = goout))
g2 + geom_boxplot() + ylab("goout")
####Here I see my guess of people going out more also drinking more to
be true. I feel this is slighlty stonger than famrel, by eye. I would
think this would correlate with alcohol high use true.
ggplot(data=alc, aes(x=factor(failures),y=alc_use))+
geom_boxplot()+
geom_hline(yintercept = 2, linetype="dashed", colour="red")
####Just looking at the distribution of failures vs alc_use above two, we see that all the people who failed a course also fall into the high_use true condition.
g3 <- ggplot(data = alc, aes(x = high_use, y = higher))
g3 + geom_count(aes(high_use, higher, alpha = 0.7, colour = higher))
####Here I had a hard time finding a plot that would show me the
comparison I wanted. It is one of my learning pains in dealing with data
visualization, finding the right plot type. But here I can see that of
the people who don’t have higher education aspirations there isn’t much
of a difference in high_use. But of those who do (who also make up the
majority of the students), there are more students who want to go to
higher eduction who don’t fall into high_use than do, which corresponds
to my guess but perhaps not strongily.
g4 <- ggplot(data = alc, aes(x = higher, y = G3))
g4 + geom_boxplot() + ylab("G3")
####was interested to see if the ambition for higher education also
correlated with grades, it seems to. Intresting
m <- glm(high_use ~ famrel + failures + goout + higher, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + failures + goout + higher,
## family = "binomial", data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4620 0.8795 -1.662 0.0964 .
## famrel -0.3747 0.1364 -2.748 0.0060 **
## failures 0.4698 0.2283 2.058 0.0396 *
## goout 0.7576 0.1199 6.319 2.63e-10 ***
## higheryes -0.5394 0.6271 -0.860 0.3897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 387.67 on 365 degrees of freedom
## AIC: 397.67
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) famrel failures goout higheryes
## -1.4620255 -0.3747109 0.4698436 0.7575575 -0.5393577
OR <- coef(m) %>% exp
OR
## (Intercept) famrel failures goout higheryes
## 0.2317664 0.6874880 1.5997439 2.1330599 0.5831227
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
CI
## 2.5 % 97.5 %
## (Intercept) 0.03965521 1.2706808
## famrel 0.52457434 0.8971249
## failures 1.02980361 2.5365490
## goout 1.69686639 2.7177864
## higheryes 0.16737211 2.0213349
m <- glm(high_use ~ famrel + failures + goout, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "guardian" "traveltime" "studytime"
## [16] "schoolsup" "famsup" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "failures" "paid"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use" "probability" "prediction"
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 238 21
## TRUE 75 36
ggplot(alc, aes(x = probability, y = high_use, col = prediction))+
geom_point()
#### taking a look at the data visually.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2594595
library(boot)
#setting up the training data from my model
cv_train <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
#subselecting a pool of test data
cv_test <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv_train$delta[1]
## [1] 0.2594595
cv_test$delta[1]
## [1] 0.272973
library(tidyverse)
library(dplyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
date()
## [1] "Fri Dec 1 09:04:02 2023"
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
library(reshape)
##
## Attaching package: 'reshape'
## The following object is masked from 'package:lubridate':
##
## stamp
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
Boston_melted <- melt(Boston)
## Using as id variables
ggplot(Boston_melted, aes(x = value)) +
geom_histogram(binwidth = 1, fill = "turquoise", color = "blue") +
facet_wrap(~variable, scales = "free") +
theme_minimal()
\(~\)
\(~\)
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle")
\(~\) ##### Here I can see there are some strong negative corerelations: dis v (indus-nox-age) and mdev v lstat. Also there are some strong positive correlations: mdev v rm (which makes sense as the more rooms you have, ideally, the higher the value of your property), idus v nox (the more industry you have the higher the relative air pollution level).
\(~\)
Boston_scaled <- as.data.frame(scale(Boston))
summary(Boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
bins <- quantile(Boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(Boston_scaled$crim, breaks = bins,labels = c("low","med_low","med_high","high"), include.lowest = TRUE)
# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)
summary(Boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
# Test and train data sets:
n <- nrow(Boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- Boston_scaled[ind,]
# create test set
test <- Boston_scaled[-ind,]
\(~\)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2425743 0.2400990 0.2599010
##
## Group means:
## zn indus chas nox rm age
## low 0.92832374 -0.9292454 -0.12090214 -0.8673259 0.47673441 -0.8580378
## med_low -0.05256228 -0.3205260 -0.03128211 -0.5993491 -0.05662749 -0.4327359
## med_high -0.38910901 0.1747716 0.17414622 0.3183751 0.15406998 0.4390995
## high -0.48724019 1.0170492 -0.12234430 1.0597397 -0.43500776 0.8263331
## dis rad tax ptratio black lstat
## low 0.8146597 -0.6826072 -0.7576318 -0.4213798 0.3806601 -0.784410661
## med_low 0.3847619 -0.5435787 -0.4936986 -0.0746698 0.3160699 -0.188104713
## med_high -0.3937597 -0.3981658 -0.3085988 -0.2694611 0.1329462 -0.006954313
## high -0.8662870 1.6388211 1.5145512 0.7815834 -0.7473673 0.922109621
## medv
## low 0.5684149
## med_low 0.0423658
## med_high 0.2151257
## high -0.7209771
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09854238 0.64965671 -0.99083421
## indus 0.06875780 -0.35434023 0.37435424
## chas -0.06502962 -0.12468746 0.07549740
## nox 0.39079528 -0.68757926 -1.27019440
## rm -0.08983278 -0.07120282 -0.02331846
## age 0.30340719 -0.37028847 -0.46388028
## dis -0.04777085 -0.23897199 0.16103997
## rad 3.11180947 0.97381413 -0.21626736
## tax 0.04469487 -0.04288839 0.65215992
## ptratio 0.08472165 0.06637226 -0.20656281
## black -0.12079331 0.01365856 0.10342180
## lstat 0.14890835 -0.10471377 0.55655487
## medv 0.14192747 -0.33765971 -0.13365979
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9521 0.0356 0.0124
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda (bi)plot
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 6 1 0
## med_low 1 21 6 0
## med_high 0 7 21 1
## high 0 0 0 22
\(~\) ##### Looking at model it performs well on the high crime prediction, but the low, med_low and med_high seem to be off slightly. med_low is the next best predicted crime level.
data("Boston")
Boston_scaled <- as.data.frame(scale(Boston))
dist_eu <- dist(Boston_scaled, method = "euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#dist_man <- dist(Boston_scaled, method = "manhattan")
#dist_man
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
\(~\) ##### Looking at the elbow it appears that 2 is the optimal number of clusters for k means
# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
##### I have no idea how to interpret this-
#Data
set.seed(32)
data("Boston")
boston_scaled <- scale(Boston) %>% as.data.frame()
#Set seven clusters:
km_bonus <- kmeans(Boston, centers = 8)
boston_scaled$cluster <- km_bonus$cluster
Boston_bonus <- lda(cluster ~ ., data = boston_scaled)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(boston_scaled$cluster)
# plot the lda (bi)plot
plot(Boston_bonus, dimen =2)
lda.arrows(Boston_bonus, myscale = 1)
Boston_bonus
## Call:
## lda(cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6 7
## 0.27075099 0.08695652 0.15217391 0.12252964 0.10869565 0.05335968 0.18379447
## 8
## 0.02173913
##
## Group means:
## crim zn indus chas nox rm
## 1 1.0097765 -0.4872402 1.0662784 -0.04242540 0.9959393 -0.39626518
## 2 -0.4112593 1.8232528 -1.0587789 -0.27232907 -1.2032247 0.32608884
## 3 -0.4103240 0.4669153 -0.6939781 0.08558914 -0.7573615 0.40725570
## 4 -0.3115695 -0.4872402 0.6973678 0.10868063 0.6498658 -0.24065169
## 5 -0.3906267 -0.1882696 -0.4589025 -0.05757815 -0.6526048 0.01656427
## 6 -0.4105500 0.7276120 -0.7660389 -0.27232907 -0.8389328 0.12107492
## 7 -0.3726011 -0.2488802 -0.5624188 0.15101504 -0.2307439 0.26980542
## 8 -0.1918628 -0.4872402 0.8121161 0.08558914 1.3206359 -0.52452958
## age dis rad tax ptratio black
## 1 0.7599946 -0.82659650 1.5757732 1.53915759 0.8040926 -0.7189340
## 2 -1.4995727 1.76550267 -0.6321108 -0.51640025 -0.7080115 0.3436424
## 3 -0.6511612 0.50825445 -0.6835681 -1.09300067 -0.2158123 0.3816348
## 4 0.7665217 -0.74688501 -0.5298939 0.01868990 -0.2268036 0.2441186
## 5 -1.0126162 0.40445826 -0.5725993 -0.71589771 -0.1533051 0.3386187
## 6 -1.1910518 0.90003854 -0.5735274 -0.05678564 -0.3404312 0.3375271
## 7 0.4650140 0.04164319 -0.5323637 -0.67571060 -0.2506439 0.3386363
## 8 0.8257272 -0.69874374 -0.5538062 -0.12654817 -0.6723188 -1.8525431
## lstat medv
## 1 0.8432167 -0.6807081
## 2 -0.8161977 0.3320129
## 3 -0.5638615 0.6660396
## 4 0.2594944 -0.1294832
## 5 -0.5252045 0.2809099
## 6 -0.5717207 0.2900036
## 7 -0.1610779 0.2011491
## 8 0.6385135 -0.5996044
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim 0.009913887 -0.137609161 -0.04036563 0.06710828 0.008706277
## zn -0.063869885 -0.359539354 -0.03252327 -0.37386866 -0.985470185
## indus -0.064553782 -0.195044951 -0.33201434 0.87866138 -0.853101757
## chas 0.051357463 0.035927002 -0.01879004 -0.06102304 0.054091077
## nox -0.064270060 0.327990930 0.43933045 0.36992087 -0.682424547
## rm -0.031831393 0.027532509 -0.07621914 -0.14238990 0.106374027
## age -0.470229341 1.450265293 -0.23689536 -1.46202224 -0.271967600
## dis 0.361412866 -0.313512208 0.20731668 -0.62537375 -0.395082557
## rad -1.643884814 -0.791845574 -3.25137506 -0.13663787 -0.481829789
## tax -9.099980888 -0.327782726 3.36875714 -0.64786007 1.410249515
## ptratio -0.008405428 0.060541663 -0.29843909 0.19223768 -0.102382405
## black 0.023126886 -0.089392096 -0.06586979 -0.19721961 0.616762841
## lstat -0.032547271 -0.183432824 -0.09333305 0.15316806 0.091791088
## medv -0.062060374 0.005458806 -0.02837714 0.34842936 -0.056421542
## LD6 LD7
## crim 0.098841166 0.03527449
## zn 0.842438878 0.35861794
## indus 0.904577426 -0.95614725
## chas -0.048494403 0.02316471
## nox -0.321696150 0.19808743
## rm 0.079509917 -0.04629482
## age 0.130244689 -0.20378571
## dis -0.476692936 -0.65750487
## rad 0.001318226 -1.00217642
## tax -0.281439177 0.85946009
## ptratio 0.336896632 0.79163091
## black 0.782771708 -0.43714271
## lstat 0.104232162 0.73179612
## medv 0.242827431 1.10579041
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5 LD6 LD7
## 0.9703 0.0155 0.0084 0.0029 0.0018 0.0009 0.0002
####Super Bonus
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
##
## rename
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color = train$crime)
date()
## [1] "Fri Dec 1 09:04:13 2023"
library(tidyr)
library(dplyr)
library(corrplot)
library(GGally)
library(tibble)
library(ggplot2)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.3.2
human <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/data/human.csv")
human_ <- column_to_rownames(human, "Country")
ggpairs(human_[-1] ,
lower = list(continuous = wrap("points", color = "turquoise", alpha = 0.5),
combo = wrap("box", color = "orange", alpha = 0.3),
discrete = wrap("facetbar", color = "yellow", alpha = 0.3) ),
diag = list(continuous = wrap("densityDiag", color = "red", alpha = 0.5) ))
corrplot(cor(human_[-1]))
pca_human <- prcomp(human_[-1])
s <- summary(pca_human)
round(1*s$importance[2, ], digits = 2)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 1 0 0 0 0 0 0 0
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
#### Here it seems we have GNI responsible for nearly 100% of the
variation, seems to good to be true as well as incorrect. It is a good
reminder to myself that scale matters, and now we will standardize the
variables.
pca_human_sc <- prcomp(scale(human_))
s_sc <- summary(pca_human_sc)
pca_pr_z <- round(100*s_sc$importance[2, ], digits = 2)
pca_pr_z
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## 57.61 14.53 8.56 7.06 4.87 3.34 2.40 1.21 0.43
pc_lab <- paste0(names(pca_pr_z), " (", pca_pr_z, "%)")
biplot(pca_human_sc, cex = c(0.4, 0.6), col = c("grey40", "deeppink2"),xlab=pc_lab[1],
ylab=pc_lab[2])
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
view(tea)
keep_columns <- c("sex", "feminine", "effect.on.health", "healthy", "relaxing", "sugar", "exciting", "slimming", "sophisticated", "frequency")
tea_time <- select(tea, all_of(keep_columns))
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + geom_bar() + facet_wrap("name", scales = "free")
\(~\)
# multiple correspondence analysis
mca <- MCA(tea_time, graph = TRUE)
# summary of the model
summary.MCA(mca)
##
## Call:
## MCA(X = tea_time, graph = TRUE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.181 0.144 0.131 0.121 0.108 0.099 0.092
## % of var. 15.094 12.002 10.900 10.117 8.980 8.220 7.633
## Cumulative % of var. 15.094 27.096 37.996 48.113 57.093 65.313 72.946
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.084 0.074 0.060 0.054 0.053
## % of var. 7.017 6.173 4.973 4.491 4.400
## Cumulative % of var. 79.963 86.136 91.109 95.600 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.562 0.581 0.284 | -0.663 1.017 0.394 | 0.223
## 2 | 0.215 0.085 0.041 | -0.321 0.239 0.092 | 0.726
## 3 | -0.283 0.147 0.095 | -0.575 0.765 0.394 | 0.136
## 4 | 0.279 0.144 0.098 | -0.534 0.661 0.359 | -0.338
## 5 | 0.245 0.111 0.054 | -0.416 0.401 0.157 | 0.003
## 6 | 0.261 0.126 0.069 | -0.800 1.482 0.643 | 0.102
## 7 | 0.302 0.168 0.058 | -0.497 0.572 0.158 | 0.169
## 8 | -0.487 0.436 0.210 | 0.174 0.070 0.027 | -0.450
## 9 | 0.088 0.014 0.009 | -0.135 0.042 0.020 | -0.164
## 10 | 0.014 0.000 0.000 | -0.698 1.128 0.532 | -0.011
## ctr cos2
## 1 0.127 0.045 |
## 2 1.342 0.470 |
## 3 0.047 0.022 |
## 4 0.292 0.144 |
## 5 0.000 0.000 |
## 6 0.027 0.011 |
## 7 0.073 0.018 |
## 8 0.516 0.179 |
## 9 0.069 0.030 |
## 10 0.000 0.000 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## F | -0.513 8.612 0.384 -10.710 | 0.190 1.492 0.053
## M | 0.748 12.565 0.384 10.710 | -0.278 2.177 0.053
## feminine | -0.708 11.907 0.378 -10.636 | 0.459 6.299 0.159
## Not.feminine | 0.534 8.982 0.378 10.636 | -0.346 4.752 0.159
## effect on health | 0.547 3.629 0.084 5.020 | 0.975 14.512 0.268
## No.effect on health | -0.154 1.024 0.084 -5.020 | -0.275 4.093 0.268
## healthy | -0.296 3.383 0.204 -7.815 | -0.321 5.008 0.240
## Not.healthy | 0.690 7.893 0.204 7.815 | 0.749 11.686 0.240
## No.relaxing | 0.332 2.297 0.067 4.468 | 0.362 3.430 0.079
## relaxing | -0.201 1.388 0.067 -4.468 | -0.219 2.073 0.079
## v.test Dim.3 ctr cos2 v.test
## F 3.975 | 0.216 2.118 0.068 4.513 |
## M -3.975 | -0.315 3.090 0.068 -4.513 |
## feminine 6.898 | -0.019 0.012 0.000 -0.288 |
## Not.feminine -6.898 | 0.014 0.009 0.000 0.288 |
## effect on health 8.951 | 0.040 0.027 0.000 0.366 |
## No.effect on health -8.951 | -0.011 0.008 0.000 -0.366 |
## healthy -8.479 | -0.015 0.013 0.001 -0.406 |
## Not.healthy 8.479 | 0.036 0.030 0.001 0.406 |
## No.relaxing 4.868 | 0.888 22.697 0.476 11.933 |
## relaxing -4.868 | -0.536 13.715 0.476 -11.933 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.384 0.053 0.068 |
## feminine | 0.378 0.159 0.000 |
## effect.on.health | 0.084 0.268 0.000 |
## healthy | 0.204 0.240 0.001 |
## relaxing | 0.067 0.079 0.476 |
## sugar | 0.139 0.001 0.244 |
## exciting | 0.066 0.140 0.021 |
## slimming | 0.107 0.028 0.001 |
## sophisticated | 0.091 0.232 0.074 |
## frequency | 0.291 0.239 0.422 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali", palette = c("blue", "turquoise", "red"))
\(~\)
(more chapters to be added similarly as we proceed with the course!)